In Mecklenburg County, the housing market shows good trend recent years. Housing Price Prediction is necessary and helpful. The hedonic model is a theoretical framework for predicting home prices by deconstructing house price into the value of its constituent parts.
Q: What makes this a difficult exercise? A: (We have a lot to say…) How to choose a good independent variable / We met a lot of errors when we transfer a numerical variable to a categorical variable/ Random error in every trunks…Fortunately! We finally did it.
Q: What is your overall modeling strategy? A: We first make OLS predictions according to the original dataset, and then filter the predictions according to the performance of the r square and MAPE to continuously optimize our model.
Q: Briefly summarize your results. A: Just like our team name – SUPER MODEL, which means our regression model will be a super model to predict the price!
Let’s just have a look! We try to make model accurate!
mapTheme <- function(base_size = 50) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1),
legend.background = element_blank(),
legend.key = element_blank(),
legend.position = "right"
)
}
There are several way to collect data. Our main source is an official website from Mecklenburg County offering a large amount of data including traffic transit, parks, churchs, colleges, neighborhoods, and so on.
The second source is census tract data which can be accessed with
tidycensus package in R.
studentdata <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/studentData.geojson") %>%
st_transform('ESRI:102719')
## Reading layer `studentData' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/studentData.geojson'
## using driver `GeoJSON'
## Simple feature collection with 46183 features and 71 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -81.05626 ymin: 35.01345 xmax: -80.55941 ymax: 35.51012
## Geodetic CRS: WGS 84
data <-
studentdata[,c('stname','zipcode','price','sale_year','storyheigh','fullbaths','halfbaths','bedrooms','units','yearbuilt','heatedarea','heatedfuel','foundation','bldggrade','shape_Leng','shape_Area','musaID','toPredict','geometry')]
collegedata <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/colleges/Colleges.shp") %>%
st_transform('ESRI:102719')
## Reading layer `Colleges' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/colleges/Colleges.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 25 features and 11 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1407818 ymin: 481350 xmax: 1493233 ymax: 642582.4
## Projected CRS: NAD83 / North Carolina (ftUS)
parkdata <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/park_locations/Park_Locations.shp") %>%
st_transform('ESRI:102719')
## Reading layer `Park_Locations' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/park_locations/Park_Locations.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 351 features and 97 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1394440 ymin: 467223.2 xmax: 1524300 ymax: 644298.8
## Projected CRS: NAD83 / North Carolina (ftUS)
churchdata <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/churches/Churches.shp") %>%
st_transform('ESRI:102719')
## Reading layer `Churches' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/churches/Churches.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 880 features and 44 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1392478 ymin: 466634 xmax: 1528037 ymax: 644203
## Projected CRS: NAD83 / North Carolina (ftUS)
tfdata <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/ncdot_trafficestimates/NCDOT_TrafficEstimates.shp") %>%
st_transform('ESRI:102719')
## Reading layer `NCDOT_TrafficEstimates' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/ncdot_trafficestimates/NCDOT_TrafficEstimates.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 2064 features and 6 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1373587 ymin: 448274.1 xmax: 1553672 ymax: 667493
## Projected CRS: NAD83 / North Carolina (ftUS)
bd <-
st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/npa/npa.shp") %>%
st_transform('ESRI:102719')
## Reading layer `npa' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/npa/npa.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 458 features and 4 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 1384164 ymin: 460652.7 xmax: 1536927 ymax: 647732.9
## Projected CRS: NAD83 / North Carolina (ftUS)
college <- select(collegedata, SchoolName)
church <- select(churchdata, address)
park <- select(parkdata, prkname)
transit <- select(tfdata, county, location)
transit <- transit[transit$county=="MECKLENBURG",]
st_c <- st_coordinates
bd.sf <-
bd %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
data.sf <-
data %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
college.sf <-
college %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
church.sf <-
church %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
park.sf <-
park %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
transit.sf <-
transit %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
In our model, the dependent variable is house sale price.
In addition, there are 3 main factors:
1.Interal characteristics
2.Amenities of decision factor
3.Spatial structure
First thing we need to do is to score yearbuilt with quantile.
divyear <-
data.sf$yearbuilt
pctdivyear <-
quantile(divyear,c(0,0.25,0.50,0.75,1))
data.sf$year = 0
data.sf$year[which(data.sf$yearbuilt < 1979 )] =1
data.sf$year[which((data.sf$yearbuilt >= 1979)&(data.sf$yearbuilt < 1999))] =2
data.sf$year[which((data.sf$yearbuilt >= 1999)&(data.sf$yearbuilt < 2015))] =3
data.sf$year[which((data.sf$yearbuilt >= 2015)&(data.sf$yearbuilt < 2022))] =4
data.sf <- select(data.sf, -c(yearbuilt))
Second, we need to calculates the ‘average nearest neighbor distance’ from each home sale to its k nearest neighbor transit stop, college, church and park. A generalizable model is also one that predicts with comparable accuracy across different groups. We predicted each of these features are correlated with house price.
bd.sf <-bd.sf[,c('id')]
alldata <- st_join(data.sf, bd.sf, join = st_intersects)
alldata$id <- as.factor(alldata$id)
st_c <- st_coordinates
data.sf <-
data.sf %>%
mutate(
college_nn1 = nn_function(st_c(data.sf), st_c(college.sf), 1),
church_nn1 = nn_function(st_c(data.sf), st_c(church.sf), 1),
park_nn2 = nn_function(st_c(data.sf), st_c(park.sf), 2),
transit_nn1 = nn_function(st_c(data.sf), st_c(transit.sf), 1),
transit_nn2 = nn_function(st_c(data.sf), st_c(transit.sf), 2),
transit_nn3 = nn_function(st_c(data.sf), st_c(transit.sf), 3),
transit_nn4 = nn_function(st_c(data.sf), st_c(transit.sf), 4),
transit_nn5 = nn_function(st_c(data.sf), st_c(transit.sf), 5),
)
map2 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
stat_density2d(data = data.frame(st_coordinates(college.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
bins = 50, geom = 'polygon') +
scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
labs(title = "Density of College") +
mapTheme()
map3 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
stat_density2d(data = data.frame(st_coordinates(transit.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
bins = 50, geom = 'polygon') +
scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
labs(title = "Density of Transit") +
mapTheme()
map4 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
stat_density2d(data = data.frame(st_coordinates(park.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
labs(title = "Density of Park") +
mapTheme()
map5 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
stat_density2d(data = data.frame(st_coordinates(church.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
labs(title = "Density of Church") +
mapTheme()
ggarrange(map2, map3, map4, map5, ncol = 2, nrow = 2)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
In order to ensure that the model results are more fit, we did data type transformation here, such as character type “storyheigh”, “heatedfuel”, “foundation”,“bldggrade” etc. We changed their type into categorical type.
data.sf$year <- as.factor(data.sf$year)
data.sf$storyheigh <- as.factor(data.sf$storyheigh)
data.sf$heatedfuel <- as.factor(data.sf$heatedfuel)
data.sf$foundation <- as.factor(data.sf$foundation)
data.sf$bldggrade <- as.factor(data.sf$bldggrade)
census_api_key("a9e713a06a0a0f8ec8531e047c9d01e7d9f507d9", overwrite = TRUE, install= TRUE)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "a9e713a06a0a0f8ec8531e047c9d01e7d9f507d9"
census_vars <- c( "B01001_001E", # ACS total Pop estimate
"B25002_001E", # Estimate of total housing units
"B25002_003E", # Number of vacant housing units
"B19013_001E", # Median HH Income ($)
"B02001_002E", # People describing themselves as "white alone"
"B06009_005E", # TotalBachelor's degree
"B06009_006E", # Total graduate or professional degree
"B06012_002E", # Number of below 100 percent of the poverty level
"B23025_002E", # Number of people in labor force
"B23025_005E") # Number of people in labor force unemployed
# get access to census data of 2020
acsTracts20 <- get_acs(geography = "tract",
year = 2020,
variables = census_vars,
geometry = TRUE,
state = 37,
county = 119,
output = "wide") %>%
dplyr::select(NAME, NAME,all_of(census_vars)) %>%
rename (pop = B01001_001E,
hu = B25002_001E,
vacant = B25002_003E,
med_hh_income = B19013_001E,
white = B02001_002E,
bachdeg = B06009_005E,
graddeg = B06009_006E,
belpov100 = B06012_002E,
labor_f = B23025_002E,
labor_f_unemployed = B23025_005E) %>%
mutate(pctvacant = vacant/hu,
pctwhite = white/pop,
pctunemployed = labor_f_unemployed/labor_f,
pctbach = bachdeg/pop,
pctgrad = graddeg/pop,
raceContext = ifelse(pctwhite > .5, "Majority White", "Majority Non-White")
) %>%
dplyr::select(-vacant,-hu,-white,-labor_f_unemployed,-labor_f,-bachdeg,-graddeg) %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102719') %>%
distinct()
## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
| | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|========= | 12%
|
|========== | 14%
|
|============ | 18%
|
|============== | 20%
|
|=============== | 22%
|
|================ | 22%
|
|==================== | 29%
|
|===================== | 30%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================== | 37%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 49%
|
|==================================== | 51%
|
|===================================== | 52%
|
|====================================== | 54%
|
|======================================= | 56%
|
|========================================= | 58%
|
|========================================= | 59%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================== | 66%
|
|=============================================== | 68%
|
|====================================================== | 77%
|
|======================================================= | 79%
|
|========================================================= | 81%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================= | 93%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
Add census data to our dataset
alldata <- st_join(data.sf, acsTracts20, join = st_intersects) %>% na.omit()
alldata.sf <-
alldata %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102179')
We split the dataset into a training and test set by select
“MODELLING” and “CHALLENGE”, then summarize the data with a polished
table of our training data set by stargazer .
modelling <- alldata.sf[alldata.sf$toPredict == "MODELLING" ,]
challenge <- alldata.sf[alldata.sf$toPredict == "CHALLENGE" ,]
md <- st_drop_geometry(modelling)
stargazer(md,
type="text",
title = "Main Categories of Decision Factor: Internal Characteristics",
keep=c("fullbaths","halfbaths","bedrooms","units","heatedarea","shape_Leng","shape_Area"),
covariate.labels = c("Number of Fullbaths", "Number of Halfbaths", "Number of Bedrooms", "Number of Unts", "Heated Area", "Length of Shape", "Area of Shape"))
##
## Main Categories of Decision Factor: Internal Characteristics
## ========================================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------------------------
## Number of Fullbaths 46,009 2.282 0.825 0 9
## Number of Halfbaths 46,009 0.596 0.528 0 4
## Number of Bedrooms 46,009 3.510 0.942 0 65
## Number of Unts 46,009 0.981 0.971 0 205
## Heated Area 46,009 2,359.218 1,060.543 0.000 14,710.000
## Length of Shape 46,009 504.231 253.156 152.910 10,798.810
## Area of Shape 46,009 15,876.130 35,083.170 1,139.637 3,486,865.000
## ------------------------------------------------------------------------
stargazer(md,
type="text",
title = "Main Categories of Decision Factor: Amenities and public Services",
keep=c("college_nn1","church_nn1","park_nn2","transit_nn3"),
covariate.labels = c("Average Distance Between Each House and 1 Nearest College", "Average Distance Between Each House 1 Nearest Church", "Average Distance Between Each House 2 Nearest Parks", "Average Distance Between Each House 3 Nearest Transit Stops"))
##
## Main Categories of Decision Factor: Amenities and public Services
## ==========================================================================================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------------------------------------------------------------------
## Average Distance Between Each House and 1 Nearest College 46,009 16,774.630 8,982.394 459.933 54,413.280
## Average Distance Between Each House 1 Nearest Church 46,009 2,932.255 2,097.341 17.465 14,635.710
## Average Distance Between Each House 2 Nearest Parks 46,009 4,941.578 2,574.794 205.557 16,614.240
## Average Distance Between Each House 3 Nearest Transit Stops 46,009 3,013.649 1,405.207 230.756 12,735.250
## ----------------------------------------------------------------------------------------------------------
stargazer(md,
type="text",
title = "Main Categories of Decision Factor: Spatial Structure",
keep=c("pop","med_hh_income","belpov100 ","pctvacant","pctunemployed","pctbach","pctgrad"),
covariate.labels = c("Total Population 2020", "Median Household Income", "Number of Below 100% of the Poverty Level", "Percent of Vacant Housing Units","Percent of White People", "Percent of Unemployed Labor Force", "Percent of Bachelor's Degree", "Percent of Graduate"))
##
## Main Categories of Decision Factor: Spatial Structure
## =====================================================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------------------------------------
## Total Population 2020 46,009 4,065.882 1,310.872 790 7,703
## Median Household Income 46,009 86,187.900 37,427.110 17,685 238,750
## Number of Below 100% of the Poverty Level 46,009 0.058 0.043 0.000 0.293
## Percent of Vacant Housing Units 46,009 0.046 0.039 0.000 0.225
## Percent of White People 46,009 0.214 0.089 0.018 0.460
## Percent of Unemployed Labor Force 46,009 0.112 0.065 0.004 0.322
## -------------------------------------------------------------------------------------
facvar <-
select_if(st_drop_geometry(modelling), is.factor) %>% na.omit()
numvar <-
select_if(st_drop_geometry(modelling), is.numeric) %>% na.omit()
We use bar plots to visualize our categorical data. In this way, we can see the specific categories of categorical data and also their average price distribution.
catbar <- st_drop_geometry(modelling)
catbar %>%
dplyr::select(price, storyheigh, fullbaths, halfbaths, bedrooms, units, heatedfuel, foundation, bldggrade, year) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_bar(width = 0.5, position = "dodge", stat = "summary", fun.y = "mean", color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
facet_wrap(~Variable, ncol = 5, scales = "free") +
labs(title = "Price as a function of categorical variables", y = "Mean Price") +
theme_minimal() +
theme(axis.text.x = element_text(size = 5, angle = 45, hjust = 1),
axis.text.y = element_text(size = 6, hjust = 1))
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
In the histograms of numeric variables, it can be concluded the distribution of the number of the variables. These sample data are relatively concentrated.
# plot histograms of numeric variables
ggplot(gather(numvar), aes(value)) +
geom_histogram(color = "#6A6ACD", fill = "#d5d5e6", size = 0.4, bins = 10) +
facet_wrap(~key, scales = 'free_x') +
theme_minimal() +
scale_color_brewer(palette="Paired") +
theme(axis.text.x = element_text(size = 4, hjust = 0.5),
axis.text.y = element_text(size = 5, hjust = 1))
A correlation matrix gives us the pairwise correlation of each set of features in our data. We add the each predictors’R square of correlation matrix in the plot. Our analysis of pairwise correlations between these predictors helps us assess the degree of association between these predictors.
ggcorrplot(
hc.order = TRUE,
outline.color = "white",
round(cor(numvar), 1),
p.mat = cor_pmat(numvar),
colors = c("#5252B8", "white", "#D0A561"),
show.diag = TRUE,
lab = TRUE,
lab_size = 1.5,
lab_col = "grey40",
insig = "blank",
legend.title = "Correlation",
ggtheme = ggplot2::theme_gray) +
labs(title = "Correlation Matrix Across Numeric Variables") +
theme(plot.title = element_text(size = 15, hjust = 1),
axis.text.x = element_text(size = 6, hjust = 1),
axis.text.y = element_text(size = 6, hjust = 1)
)
cormax <- cor(numvar)
cormax.df <- as.data.frame(t(cormax))
cormax.df
## price sale_year fullbaths halfbaths bedrooms
## price 1.000000000 0.079026192 0.34011212 0.131834671 0.235757783
## sale_year 0.079026192 1.000000000 -0.07128846 -0.022490794 -0.062591236
## fullbaths 0.340112116 -0.071288464 1.00000000 0.086990730 0.590642509
## halfbaths 0.131834671 -0.022490794 0.08699073 1.000000000 0.219731603
## bedrooms 0.235757783 -0.062591236 0.59064251 0.219731603 1.000000000
## units 0.564840328 -0.030897889 -0.01128456 -0.011204918 0.038262583
## heatedarea 0.410945955 -0.060922192 0.78107220 0.363058899 0.599006145
## shape_Leng 0.207822855 0.005259271 0.10991491 -0.020177353 0.064277882
## shape_Area 0.241012232 0.011950949 0.05509760 -0.009845004 0.034121234
## musaID 0.070050667 -0.001842199 0.09194938 0.003231425 0.059194650
## college_nn1 -0.045648779 -0.002293618 0.12748936 0.052947081 0.087449983
## church_nn1 0.039079512 -0.025614290 0.25297012 0.118962184 0.195238872
## park_nn2 0.001701844 -0.008252172 0.14337608 0.084092197 0.122577698
## transit_nn1 -0.017128469 -0.015827092 0.13570166 0.085266889 0.118865871
## transit_nn2 -0.008168584 -0.021121464 0.16191733 0.097075909 0.133685351
## transit_nn3 -0.004303770 -0.023131241 0.18143453 0.106947987 0.147305722
## transit_nn4 -0.004365708 -0.024820586 0.19489647 0.110088164 0.157061637
## transit_nn5 -0.003995815 -0.025947369 0.20479166 0.113094534 0.164098332
## pop -0.046395963 0.002708378 -0.01539407 0.046484323 -0.008843498
## med_hh_income 0.274841821 -0.046078510 0.43891142 0.194989126 0.320192021
## belpov100 -0.154080225 0.030067766 -0.31532345 -0.136016727 -0.254836472
## pctvacant 0.062512356 0.012222420 -0.09642390 -0.099033912 -0.108223633
## pctwhite 0.260629526 -0.045237704 0.36817661 0.121605121 0.257981171
## pctunemployed -0.096057104 0.026845805 -0.13854434 -0.087456493 -0.097317435
## pctbach 0.256380828 -0.044373549 0.36170220 0.145187909 0.248258960
## pctgrad 0.293630205 -0.041623920 0.37459292 0.132508751 0.259539491
## units heatedarea shape_Leng shape_Area musaID
## price 0.5648403283 0.41094596 0.207822855 0.241012232 0.0700506669
## sale_year -0.0308978894 -0.06092219 0.005259271 0.011950949 -0.0018421989
## fullbaths -0.0112845578 0.78107220 0.109914908 0.055097601 0.0919493834
## halfbaths -0.0112049183 0.36305890 -0.020177353 -0.009845004 0.0032314254
## bedrooms 0.0382625827 0.59900615 0.064277882 0.034121234 0.0591946497
## units 1.0000000000 -0.01647143 0.072437883 0.131219381 -0.0002289342
## heatedarea -0.0164714296 1.00000000 0.196317188 0.109066069 0.1082989004
## shape_Leng 0.0724378835 0.19631719 1.000000000 0.816170993 0.0764059148
## shape_Area 0.1312193814 0.10906607 0.816170993 1.000000000 0.0287585953
## musaID -0.0002289342 0.10829890 0.076405915 0.028758595 1.0000000000
## college_nn1 -0.0182317589 0.13818948 0.092654911 0.062800802 0.1194380659
## church_nn1 -0.0171479846 0.29263009 0.049713334 0.037213723 0.1542252999
## park_nn2 -0.0096192279 0.15095998 0.043414684 0.026102555 0.2755920966
## transit_nn1 -0.0162114019 0.16177056 0.050270320 0.037905608 0.0339315541
## transit_nn2 -0.0156012509 0.19412943 0.058648113 0.044454281 0.0136889346
## transit_nn3 -0.0172987008 0.21778912 0.064555784 0.047479541 0.0026889617
## transit_nn4 -0.0180069778 0.23231658 0.067353529 0.048549636 0.0085976795
## transit_nn5 -0.0184545888 0.24259211 0.069086938 0.049748763 0.0145862701
## pop 0.0076019876 -0.01807662 -0.011582909 -0.006514217 0.0822508389
## med_hh_income -0.0131669154 0.53445562 0.115551231 0.055263270 0.1802322819
## belpov100 0.0119076119 -0.37829025 -0.079051457 -0.047764182 -0.0620536654
## pctvacant 0.0199782838 -0.10234441 -0.004520817 -0.016163991 -0.1381158792
## pctwhite -0.0020504712 0.43902514 0.145459426 0.069452638 0.1795477574
## pctunemployed 0.0097445317 -0.18238950 -0.042951384 -0.024350076 -0.0673455585
## pctbach -0.0075398136 0.44153918 0.097848737 0.042869097 0.2087682764
## pctgrad -0.0052532860 0.45482504 0.088959048 0.034202585 0.1969485604
## college_nn1 church_nn1 park_nn2 transit_nn1 transit_nn2
## price -0.045648779 0.03907951 0.001701844 -0.01712847 -0.008168584
## sale_year -0.002293618 -0.02561429 -0.008252172 -0.01582709 -0.021121464
## fullbaths 0.127489361 0.25297012 0.143376079 0.13570166 0.161917333
## halfbaths 0.052947081 0.11896218 0.084092197 0.08526689 0.097075909
## bedrooms 0.087449983 0.19523887 0.122577698 0.11886587 0.133685351
## units -0.018231759 -0.01714798 -0.009619228 -0.01621140 -0.015601251
## heatedarea 0.138189476 0.29263009 0.150959982 0.16177056 0.194129427
## shape_Leng 0.092654911 0.04971333 0.043414684 0.05027032 0.058648113
## shape_Area 0.062800802 0.03721372 0.026102555 0.03790561 0.044454281
## musaID 0.119438066 0.15422530 0.275592097 0.03393155 0.013688935
## college_nn1 1.000000000 0.40296112 0.348551657 0.32228523 0.391227215
## church_nn1 0.402961117 1.00000000 0.368271966 0.44614150 0.502067817
## park_nn2 0.348551657 0.36827197 1.000000000 0.31450100 0.326962123
## transit_nn1 0.322285227 0.44614150 0.314500997 1.00000000 0.942931566
## transit_nn2 0.391227215 0.50206782 0.326962123 0.94293157 1.000000000
## transit_nn3 0.437056551 0.55003199 0.338455484 0.88237426 0.976387162
## transit_nn4 0.482845769 0.59051102 0.359278773 0.83596005 0.942056640
## transit_nn5 0.510278926 0.61604949 0.373174007 0.79660854 0.907668470
## pop 0.226750937 0.08218778 0.131024135 0.08929851 0.095035719
## med_hh_income 0.161995791 0.35632855 0.100781382 0.13584099 0.166317178
## belpov100 -0.089592109 -0.25445630 -0.088033206 -0.12074054 -0.140075049
## pctvacant -0.194911544 -0.21794452 -0.223557436 -0.23140926 -0.238346623
## pctwhite 0.029457238 0.24038648 0.023607898 0.06327968 0.062909534
## pctunemployed -0.063076678 -0.10567463 0.010623030 -0.02914095 -0.033436203
## pctbach 0.027434630 0.20545163 -0.006980639 0.06330046 0.061849261
## pctgrad -0.085060136 0.14551632 -0.002126861 -0.03090743 -0.037005764
## transit_nn3 transit_nn4 transit_nn5 pop med_hh_income
## price -0.004303770 -0.004365708 -0.003995815 -0.046395963 0.27484182
## sale_year -0.023131241 -0.024820586 -0.025947369 0.002708378 -0.04607851
## fullbaths 0.181434533 0.194896465 0.204791659 -0.015394071 0.43891142
## halfbaths 0.106947987 0.110088164 0.113094534 0.046484323 0.19498913
## bedrooms 0.147305722 0.157061637 0.164098332 -0.008843498 0.32019202
## units -0.017298701 -0.018006978 -0.018454589 0.007601988 -0.01316692
## heatedarea 0.217789117 0.232316583 0.242592109 -0.018076621 0.53445562
## shape_Leng 0.064555784 0.067353529 0.069086938 -0.011582909 0.11555123
## shape_Area 0.047479541 0.048549636 0.049748763 -0.006514217 0.05526327
## musaID 0.002688962 0.008597679 0.014586270 0.082250839 0.18023228
## college_nn1 0.437056551 0.482845769 0.510278926 0.226750937 0.16199579
## church_nn1 0.550031988 0.590511022 0.616049493 0.082187780 0.35632855
## park_nn2 0.338455484 0.359278773 0.373174007 0.131024135 0.10078138
## transit_nn1 0.882374263 0.835960052 0.796608539 0.089298511 0.13584099
## transit_nn2 0.976387162 0.942056640 0.907668470 0.095035719 0.16631718
## transit_nn3 1.000000000 0.987231698 0.965308251 0.095733820 0.20131727
## transit_nn4 0.987231698 1.000000000 0.993062192 0.097792924 0.22621194
## transit_nn5 0.965308251 0.993062192 1.000000000 0.100241032 0.24530860
## pop 0.095733820 0.097792924 0.100241032 1.000000000 0.01926385
## med_hh_income 0.201317266 0.226211939 0.245308599 0.019263850 1.00000000
## belpov100 -0.159581418 -0.171723114 -0.179943665 0.310517764 -0.59187874
## pctvacant -0.253638609 -0.267612056 -0.278880197 -0.172003727 -0.18708107
## pctwhite 0.075385813 0.082785878 0.089185935 -0.134766423 0.67233000
## pctunemployed -0.044328613 -0.048908783 -0.052369340 0.023023309 -0.33633678
## pctbach 0.075209724 0.086046069 0.094594506 -0.133811823 0.71181624
## pctgrad -0.030320478 -0.025687105 -0.020735949 -0.132581658 0.73091306
## belpov100 pctvacant pctwhite pctunemployed pctbach
## price -0.15408022 0.062512356 0.260629526 -0.096057104 0.256380828
## sale_year 0.03006777 0.012222420 -0.045237704 0.026845805 -0.044373549
## fullbaths -0.31532345 -0.096423900 0.368176612 -0.138544338 0.361702203
## halfbaths -0.13601673 -0.099033912 0.121605121 -0.087456493 0.145187909
## bedrooms -0.25483647 -0.108223633 0.257981171 -0.097317435 0.248258960
## units 0.01190761 0.019978284 -0.002050471 0.009744532 -0.007539814
## heatedarea -0.37829025 -0.102344405 0.439025142 -0.182389501 0.441539185
## shape_Leng -0.07905146 -0.004520817 0.145459426 -0.042951384 0.097848737
## shape_Area -0.04776418 -0.016163991 0.069452638 -0.024350076 0.042869097
## musaID -0.06205367 -0.138115879 0.179547757 -0.067345558 0.208768276
## college_nn1 -0.08959211 -0.194911544 0.029457238 -0.063076678 0.027434630
## church_nn1 -0.25445630 -0.217944517 0.240386479 -0.105674629 0.205451632
## park_nn2 -0.08803321 -0.223557436 0.023607898 0.010623030 -0.006980639
## transit_nn1 -0.12074054 -0.231409258 0.063279682 -0.029140948 0.063300460
## transit_nn2 -0.14007505 -0.238346623 0.062909534 -0.033436203 0.061849261
## transit_nn3 -0.15958142 -0.253638609 0.075385813 -0.044328613 0.075209724
## transit_nn4 -0.17172311 -0.267612056 0.082785878 -0.048908783 0.086046069
## transit_nn5 -0.17994367 -0.278880197 0.089185935 -0.052369340 0.094594506
## pop 0.31051776 -0.172003727 -0.134766423 0.023023309 -0.133811823
## med_hh_income -0.59187874 -0.187081070 0.672329999 -0.336336777 0.711816239
## belpov100 1.00000000 0.151029056 -0.542750934 0.314907583 -0.606753394
## pctvacant 0.15102906 1.000000000 -0.106457096 0.008780449 -0.034536516
## pctwhite -0.54275093 -0.106457096 1.000000000 -0.319582238 0.742082041
## pctunemployed 0.31490758 0.008780449 -0.319582238 1.000000000 -0.422797700
## pctbach -0.60675339 -0.034536516 0.742082041 -0.422797700 1.000000000
## pctgrad -0.56330874 -0.007865324 0.683025033 -0.408132927 0.750232475
## pctgrad
## price 0.293630205
## sale_year -0.041623920
## fullbaths 0.374592925
## halfbaths 0.132508751
## bedrooms 0.259539491
## units -0.005253286
## heatedarea 0.454825044
## shape_Leng 0.088959048
## shape_Area 0.034202585
## musaID 0.196948560
## college_nn1 -0.085060136
## church_nn1 0.145516320
## park_nn2 -0.002126861
## transit_nn1 -0.030907432
## transit_nn2 -0.037005764
## transit_nn3 -0.030320478
## transit_nn4 -0.025687105
## transit_nn5 -0.020735949
## pop -0.132581658
## med_hh_income 0.730913061
## belpov100 -0.563308743
## pctvacant -0.007865324
## pctwhite 0.683025033
## pctunemployed -0.408132927
## pctbach 0.750232475
## pctgrad 1.000000000
At this time, we are going to delete collinear independent(R srquare>0.75 or <-0.75).
We choose the length of shape, distance to nearest 3 transit stops, percent of graduate, and move the shape area, percent bachelor degree and distance to nearest others transit stops out at the same time.
cormaxsel <-
(cormax.df < 0.75 &cormax.df > -0.75 )*cormax.df
cormaxsel2 <-
which(t(t(cormaxsel) ==0), arr.ind=TRUE)
cormaxsel2[,"row"] <- rownames(cormaxsel)[as.numeric(cormaxsel2[,"row"])]
cormaxsel2[,"col"] <- colnames(cormaxsel)[as.numeric(cormaxsel2[,"col"])]
for (i in(1:nrow(cormaxsel2))){
if(cormaxsel2[i,1] != cormaxsel2[i,2]){
print(cormaxsel2[i,])
}
}
## row col
## "heatedarea" "fullbaths"
## row col
## "fullbaths" "heatedarea"
## row col
## "shape_Area" "shape_Leng"
## row col
## "shape_Leng" "shape_Area"
## row col
## "transit_nn2" "transit_nn1"
## row col
## "transit_nn3" "transit_nn1"
## row col
## "transit_nn4" "transit_nn1"
## row col
## "transit_nn5" "transit_nn1"
## row col
## "transit_nn1" "transit_nn2"
## row col
## "transit_nn3" "transit_nn2"
## row col
## "transit_nn4" "transit_nn2"
## row col
## "transit_nn5" "transit_nn2"
## row col
## "transit_nn1" "transit_nn3"
## row col
## "transit_nn2" "transit_nn3"
## row col
## "transit_nn4" "transit_nn3"
## row col
## "transit_nn5" "transit_nn3"
## row col
## "transit_nn1" "transit_nn4"
## row col
## "transit_nn2" "transit_nn4"
## row col
## "transit_nn3" "transit_nn4"
## row col
## "transit_nn5" "transit_nn4"
## row col
## "transit_nn1" "transit_nn5"
## row col
## "transit_nn2" "transit_nn5"
## row col
## "transit_nn3" "transit_nn5"
## row col
## "transit_nn4" "transit_nn5"
## row col
## "pctgrad" "pctbach"
## row col
## "pctbach" "pctgrad"
We chose four factors that we thought were most relevant to house price, but it appears that school has little effect on home prices. But I wonder what will change in the multi-factor model afterwards.
numplot1 <- st_drop_geometry(numvar)
numplot1 %>%
dplyr::select(price, pctgrad, park_nn2, med_hh_income, pctvacant) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .2,alpha = .1, colour = "#2d2d42") +
geom_smooth(method = "lm", se=F, colour = "#dea243") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price As a Function of Continuous Variables") +
theme_nice()
## `geom_smooth()` using formula 'y ~ x'
numplot2 <- st_drop_geometry(numvar)
numplot2 %>%
dplyr::select(price, transit_nn1, transit_nn2, transit_nn3, transit_nn4, transit_nn5) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .2,alpha = .05, colour = "#2d2d42") +
geom_smooth(method = "lm", se=F, colour = "#dea243") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of continuous variables") +
theme(axis.text.x = element_text(size = 5, hjust = 1),
axis.text.y = element_text(size = 2, hjust = 1)) +
theme_nice()
## `geom_smooth()` using formula 'y ~ x'
Below is a mapping of home sales prices in Mecklenburge. The yellow part means high price, and the purple part means low price. So the high price area mainly concentrate in the north and the south of the Mecklenburge.
ggplot() +
geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
geom_sf(data = data.sf, aes(colour = q5(price)),
show.legend = "point", size = .1, alpha = .1) +
scale_colour_manual(values = palette5,
labels=qBr(data,"price"),
name="Home Sale Price in US Dollars\n(Quintile Breaks)") +
labs(title="Home Sale Price in Mecklenburge, NC") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
Below are three mappings of interesting independent variables. Actually, the distribution are kind of similar in spatial area. But the density in some areas are different. The distribution of average of income are concentrated. But the heated area has some high values in the north of the city.
ggplot() +
geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
geom_sf(data = modelling, aes(colour = q5(heatedarea)),
show.legend = "points", size = .1, alpha = .1) +
scale_colour_manual(values = palette5,
labels=qBr(modelling,"heatedarea"),
name="Heated Area\n(QuintileBreaks)") +
labs(title="Heated Area Distribution") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
ggplot() +
geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
geom_sf(data = modelling, aes(colour = q5(pop)),
show.legend = "points", size = .1, alpha = .1) +
scale_colour_manual(values = palette5,
labels=qBr(modelling,"pop"),
name="Population\n(QuintileBreaks)") +
labs(title="Population Distribution") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
ggplot() +
geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
geom_sf(data = modelling, aes(colour = q5(med_hh_income)),
show.legend = "points", size = .1, alpha = .1) +
scale_colour_manual(values = palette5,
labels=qBr(modelling,"med_hh_income"),
name="Income in US Dollars\n(QuintileBreaks)") +
labs(title="Average Income of Household Each House") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
Machine Learning is awesome and powerful, but it can also appear incredibly complicated. The process typically has five stages: namely data wrangling, exploratory analysis, feature engineering, feature selection as well as model estimation and validation.
The first method of evaluating the model: the accuracy. The second method of evaluating models: the generalizability. What is Generalizability? The results are widely applicable to different types of data.
In order to improve the accuracy, that is, to reduce the difference between the predicted and observed house prices, we do some data cleaning, including removing impossible data. In addition, we still need to see the correlation between predictors by using ggcorrplot() to plot a correlation matrix.
We will use the createDataPartition function to randomly
split the current data into 70% of the training and 30% of the test data
set. To train the model, We created a training data set (data set with
known sale prices) which included 45581 observations to “train” the
regression model to predict the home sales price in the test data set
(the data set of homes with unknown, or 0, prices).
Secondly, the errors! We use the MAE, MAPE, of a randomly selected
test set to see how much our predictions differ from the observed home
selling prices, and to see how good the model is. The smaller the MAPE,
the better the model. We useabs()function to calculate the
error, MAE and MAPE, use mean() function to calculate the
average of MAE and MAPE.
Thirdly, In evaluating the predictive ability of our model, We got two separate approaches. The first approach was “In-sample training”, in which I divided the modelling data-set into two groups(M.test and M.train), and used one of the groups to predict the prices in the other. The second approach was a K-fold cross-validation which randomly divids the training set into ten equal “folds”, and one by one predicts prices for each of the folds using the remaining nine folds. K-fold cross-validation is a good way to measure the performance of a model on new data.
Next, Moran’s I is a method to examine whether the model was sufficiently capturing spatial structure of prices. Meanwhile, it can indicate some spatial dynamic that was not accounted for in the mode.
Finding the residuals to be spatially correlated, we use the partitioned data from the census data as the neighborhood data. We added a new spatial feature to the model, hoping to balance the spatial correlation in the residuals. Comparing the MAPE of the model with the two predictions, the MAPE of the new model is reduced. This indicates that this spatial feature has an effect in the model.
Finally, we used generalizability to test the applicability of the model. Using different partitioning methods, we verify that the model can predict accurately. Comparing the change of MAPE values of these two mods, the smaller the change value is, the more applicable the model is.
Thus, we randomly select 70% of the home sale price observations from
the dataset as the training set, and 30% of the dataset as the testing
dataset bycreateDataPartition.
modelling_sub_200k <- modelling %>%
filter(price <= 2000000)
inTrain <- createDataPartition(
y = paste(modelling$NAME,modelling$sale_year,modelling$storyheigh,modelling$fullbaths,modelling$halfbaths,modelling$bedrooms,modelling$units,modelling$heatedarea,modelling$heatedfuel,modelling$foundation,modelling$bldggrade,modelling$year),
p = .70, list = FALSE)
## Warning in createDataPartition(y = paste(modelling$NAME, modelling$sale_year, :
## Some classes have a single record ( Census Tract 1.02, Mecklenburg County, North
## Carolina 2021 1 STORY 1 0 2 1 913 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 2 1 1034 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 2 1 1092 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 2 1 868 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 2 1 972 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1040 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1106 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1114 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1180 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1248 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1263 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1272 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1324 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1352 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1446 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1494 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 1 2 1 1717 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 1
## 2 1 1800 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 2 1 1170 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 2 1 1287 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 2 2 1776 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 3 1 1200 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1248 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 3 1 1359 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 3 1 1514 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1535 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 3 1 1644 ELECTRIC CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County,
## North Carolina 2020 1 STORY 2 0 3 1 1711 GAS CRAWL SPACE GOOD 1, Census Tract
## 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1732 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1
## 1764 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County, North Carolina
## 2020 1 STORY 2 0 3 1 2243 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg
## County, North Carolina 2020 1 STORY 2 0 3 1 2300 GAS CRAWL SPACE GOOD 1, Census
## Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 2410 GAS CRAWL
## SPACE GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY
## 3 0 3 1 1530 GAS BASEMENT AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 3 0 3 1 1839 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 3 0 3 1 2155 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 1 0
## 2 1 1251 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 2 0 3 1 1478 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 2 0 3 1 1812 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY
## 2 0 3 1 2416 GAS CRAWL SPACE GOOD 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 2 0 4 1 1665 GAS CRAWL SPACE GOOD 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 3 0 4 1 1975 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 3
## 0 4 1 1984 GAS CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 3 0 4 1 2318 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 3 1 3 1 2573 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 4
## 0 4 1 2724 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 0 3 1 1504 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 0 3 1 1940 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2 0
## 3 1 2026 GAS CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 1856 GAS CRAWL SPACE GOOD 3, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 3 1 1928 GAS CRAWL SPACE
## GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2
## 1 3 1 2311 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 2329 GAS CRAWL SPACE GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 3 1 2418 GAS CRAWL SPACE
## GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 2 1 3 1 2436 GAS BASEMENT GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 2742 GAS CRAWL SPACE AVERAGE 1, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 4 1 2040 GAS CRAWL
## SPACE GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 2 1 4 1 2192 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 3 0 3 1 2042 GAS SLAB-RES GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 3 0 3 1 2690 GAS CRAWL SPACE
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 3 0 3 1 2717 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 3 0 4 1 2496 GAS CRAWL SPACE GOOD 3, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 3 0 4 1 2708 GAS BASEMENT
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 3 0 4 1 3210 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 3 1 4 1 2164 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 3 1 4 1 4002 GAS CRAWL SPACE
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 4 0 4 1 4034 GAS CRAWL SPACE VERY GOOD 4, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 4 1 4 1 3844 GAS CRAWL SPACE GOOD 3, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 4 1 50 3 4016 GAS CRAWL
## SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5
## STORY 2 0 3 1 897 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.5 STORY 2 1 3 1 1930 GAS BASEMENT AVERAGE 2, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.5 STORY 3 1 3 1 2082 GAS CRAWL
## SPACE GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5 STORY
## 3 1 4 1 2528 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.5 STORY 3 1 4 1 2554 GAS SLAB-RES VERY GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.5 STORY 3 1 4 1 3205 GAS CRAWL SPACE
## GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5 STORY 3 1
## 5 1 3690 GAS CRAWL SPACE VERY GOOD 4, Census Tract 10, Mecklenburg County, North
## Carolina 2020 RANCH W/BSMT 1 0 2 1 1188 GAS BASEMENT AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 RANCH W/BSMT 2 0 3 1 1
M.train.sf <- modelling_sub_200k[inTrain,]
M.test.sf <- modelling_sub_200k[-inTrain,]
M.train <- st_drop_geometry(M.train.sf)
M.test <- st_drop_geometry(M.test.sf)
reg1 <- lm(price ~ ., data = M.train %>%
dplyr::select(price,sale_year,storyheigh,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))
stargazer(reg1, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
##
## Regression Results
## ==========================================================
## Dependent variable:
## -----------------------------
## price
## ----------------------------------------------------------
## sale_year 72,727.750***
## (801.917)
## storyheigh1 STORY 39,632.020
## (73,299.900)
## storyheigh1.5 STORY 53,759.030
## (73,315.870)
## storyheigh2.0 STORY 25,202.240
## (73,285.760)
## storyheigh2.5 STORY -306.798
## (73,372.490)
## storyheigh3.0 STORY -36,354.800
## (74,874.900)
## storyheighBI-LEVEL 4,891.525
## (73,951.700)
## storyheighCAPE COD 48,841.770
## (78,865.910)
## storyheighRANCH W/BSMT 20,904.500
## (73,544.120)
## storyheighSPLIT LEVEL 1,852.319
## (73,422.810)
## fullbaths 41,591.200***
## (1,379.076)
## halfbaths 21,505.890***
## (1,551.350)
## bedrooms -7,214.694***
## (848.521)
## units 22,438.570***
## (3,294.094)
## heatedarea 87.409***
## (1.299)
## heatedfuelGAS 10,552.080***
## (1,890.182)
## heatedfuelNONE -55,814.460***
## (15,729.220)
## heatedfuelOIL/WD/COAL -75,751.410***
## (8,445.143)
## heatedfuelSOLAR/GEOTHRM 13,537.940
## (42,344.900)
## foundationCRAWL SPACE 16,577.070***
## (4,726.464)
## foundationPIER-NO FOUND WALL 233,805.300***
## (56,921.550)
## foundationSLAB-ABV GRD 46,763.020**
## (19,306.840)
## foundationSLAB-COM -29,284.830
## (56,917.750)
## foundationSLAB-RES 6,888.206
## (4,847.263)
## bldggradeCUSTOM 726,381.600***
## (14,195.270)
## bldggradeEXCELLENT 575,308.400***
## (5,507.844)
## bldggradeFAIR -60,136.340***
## (5,104.902)
## bldggradeGOOD 84,045.900***
## (1,756.326)
## bldggradeMINIMUM -88,184.070***
## (28,496.280)
## bldggradeVERY GOOD 265,346.600***
## (2,881.778)
## shape_Leng 68.910***
## (2.881)
## year1 879,996.100***
## (126,920.200)
## year2 824,306.400***
## (126,914.900)
## year3 829,367.800***
## (126,912.100)
## year4 837,279.800***
## (126,902.000)
## college_nn1 -2.414***
## (0.082)
## church_nn1 -4.628***
## (0.377)
## park_nn2 1.056***
## (0.268)
## transit_nn3 -2.115***
## (0.560)
## pop -3.748***
## (0.530)
## med_hh_income 0.797***
## (0.029)
## belpov100 33.513***
## (2.460)
## pctvacant 548,400.100***
## (15,452.610)
## pctunemployed 14,073.490
## (17,237.010)
## pctgrad 616,774.500***
## (16,319.520)
## Constant -147,928,148.000***
## (1,627,206.000)
## ----------------------------------------------------------
## Observations 45,138
## R2 0.770
## Adjusted R2 0.770
## Residual Std. Error 126,743.500 (df = 45092)
## F Statistic 3,360.218*** (df = 45; 45092)
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Floorstoryheight was removed to reduce errors and improve the model, and 23 predictor variables were included in the regression model training, including 10 categorical variables.
reg2 <- lm(price ~ ., data = M.train %>%
dplyr::select(price,sale_year,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))
stargazer(reg2,
type = "text",
title="Regression Results",
covariate.labels = c("Sale Year", "Number of Fullbaths", "Number of Halfbaths", "Number of Bedrooms", "Number of Units", "Heated Area", "Heated Fuel: GAS", "Heated Fuel: None", "Heated Fuel: OIL or WD or COAL", "Heated Fuel: SOLAR or GEOTHRM", "Foundation: CRAWL SPACE", "Foundation PIER-NO FOUND WALL", "Foundation SLAB-ABV GRD", "Foundation SLAB-COM", "Foundation SLAB-RES", "Building Grade: Custom", "Building Grade: EXCELLENT", "Building Grade: FAIR", "Building Grade: GOOD", "Building Grade: MINIMUM", "Building Grade: VERY GOOD", "Lengh of Shape", "Before 1979", "Between 1979 and 1999", "Between 1999 and 2015", "Between 2015 and 2022", "Average Distance Between Each House and 1 Nearest College", "Average Distance Between Each House 1 Nearest Church", "Average Distance Between Each House 2 Nearest Parks", "Average Distance Between Each House 3 Nearest Transit Stops", "Total Population 2020", "Median Household Income", "Number of Below 100% of the Poverty Level", "Percent of Vacant Housing Units","Percent of White People", "Percent of Unemployed Labor Force", "Percent of Bachelor's Degree", "Percent of Graduate"))
##
## Regression Results
## =========================================================================================
## Dependent variable:
## -----------------------------
## price
## -----------------------------------------------------------------------------------------
## Sale Year 72,638.790***
## (804.272)
##
## Number of Fullbaths 39,440.120***
## (1,362.375)
##
## Number of Halfbaths 15,231.520***
## (1,376.038)
##
## Number of Bedrooms -8,642.240***
## (842.505)
##
## Number of Units 22,819.110***
## (3,302.940)
##
## Heated Area 85.443***
## (1.277)
##
## Heated Fuel: GAS 11,444.350***
## (1,891.558)
##
## Heated Fuel: None -56,246.750***
## (15,772.990)
##
## Heated Fuel: OIL or WD or COAL -75,370.100***
## (8,465.220)
##
## Heated Fuel: SOLAR or GEOTHRM 14,074.410
## (42,474.460)
##
## Foundation: CRAWL SPACE 17,020.190***
## (4,645.492)
##
## Foundation PIER-NO FOUND WALL 236,991.500***
## (57,089.400)
##
## Foundation SLAB-ABV GRD 52,496.030***
## (19,338.010)
##
## Foundation SLAB-COM -30,645.940
## (57,084.520)
##
## Foundation SLAB-RES 8,001.614*
## (4,764.142)
##
## Building Grade: Custom 731,030.600***
## (14,209.490)
##
## Building Grade: EXCELLENT 577,875.600***
## (5,480.953)
##
## Building Grade: FAIR -60,348.220***
## (5,113.535)
##
## Building Grade: GOOD 85,761.580***
## (1,758.405)
##
## Building Grade: MINIMUM -88,463.310***
## (28,581.120)
##
## Building Grade: VERY GOOD 267,551.100***
## (2,875.308)
##
## Lengh of Shape 72.398***
## (2.872)
##
## Before 1979 878,177.300***
## (127,307.500)
##
## Between 1979 and 1999 822,189.000***
## (127,304.900)
##
## Between 1999 and 2015 826,258.500***
## (127,302.900)
##
## Between 2015 and 2022 835,121.200***
## (127,292.600)
##
## Average Distance Between Each House and 1 Nearest College -2.421***
## (0.082)
##
## Average Distance Between Each House 1 Nearest Church -4.519***
## (0.378)
##
## Average Distance Between Each House 2 Nearest Parks 1.066***
## (0.269)
##
## Average Distance Between Each House 3 Nearest Transit Stops -2.001***
## (0.561)
##
## Total Population 2020 -3.649***
## (0.531)
##
## Median Household Income 0.813***
## (0.029)
##
## Number of Below 100% of the Poverty Level 32.933***
## (2.465)
##
## Percent of Vacant Housing Units 556,626.200***
## (15,484.520)
##
## Percent of White People 11,668.010
## (17,285.210)
##
## Percent of Unemployed Labor Force 617,399.100***
## (16,348.200)
##
## Percent of Bachelor's Degree -147,704,199.000***
## (1,630,538.000)
##
## -----------------------------------------------------------------------------------------
## Observations 45,138
## R2 0.769
## Adjusted R2 0.769
## Residual Std. Error 127,137.400 (df = 45101)
## F Statistic 4,166.285*** (df = 36; 45101)
## =========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
alldata.sf <-
alldata.sf %>%
mutate(Price.Predict = predict(reg2, alldata.sf))
M.test.sf <-
M.test.sf %>%
mutate(Price.Predict = predict(reg2, M.test.sf),
Price.Error = Price.Predict - price,
Price.AbsError = abs(Price.Predict - price),
Price.APE = (abs(Price.Predict - price)) / Price.Predict)
Test_MAE <- mean(M.test.sf$Price.AbsError, na.rm = T)
Test_MAPE <- mean(M.test.sf$Price.APE, na.rm = T)
M.train.sf <-
M.train.sf %>%
mutate(Price.Predict = predict(reg2, M.train.sf),
Price.Error = Price.Predict - price,
Price.AbsError = abs(Price.Predict - price),
Price.APE = (abs(Price.Predict - price)) / Price.Predict)
Training_MAE <- mean(M.train.sf$Price.AbsError, na.rm = T)
Training_MAPE <- mean(M.train.sf$Price.APE, na.rm = T)
error <- data.frame(Training_MAE, Training_MAPE, Test_MAE, Test_MAPE)
error %>%
kable() %>%
kable_material(bootstrap_options = "responsive",font_size = 20, full_width = F)
| Training_MAE | Training_MAPE | Test_MAE | Test_MAPE |
|---|---|---|---|
| 80799.39 | 0.1909767 | 58297.31 | 0.1963437 |
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = M.test %>%
dplyr::select(price,sale_year,storyheigh,fullbaths,halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3),
method = "lm", trControl = fitControl, na.action = na.pass)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
reg.cv
## Linear Regression
##
## 426 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 422, 420, 422, 422, 423, 421, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 76670.93 0.7252262 57038.27
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
We part the data frame into 100 equal sized subsets and train on a subset of observations, predict on a test set, and measure goodness of fit. The standard diviation of the MAE are different in different folders. This means this model can not predict precisely in different folds.
reg.cv$resample[,]
## RMSE Rsquared MAE Resample
## 1 31720.19 0.716823881 23105.50 Fold001
## 2 95792.81 0.743850800 54471.97 Fold002
## 3 127132.11 0.725732912 113461.32 Fold003
## 4 28787.48 0.874503268 19252.47 Fold004
## 5 73002.65 0.920649278 60763.94 Fold005
## 6 29262.21 0.970050964 28883.55 Fold006
## 7 44318.88 0.138118822 35035.99 Fold007
## 8 28770.43 0.981474909 19497.86 Fold008
## 9 22001.01 0.823775149 19530.12 Fold009
## 10 93640.96 0.488169848 64640.38 Fold010
## 11 33455.20 0.689185678 26418.09 Fold011
## 12 38676.69 0.840749695 34065.85 Fold012
## 13 62131.37 0.837363384 51779.44 Fold013
## 14 107958.31 0.995160302 83562.77 Fold014
## 15 48022.29 0.540691472 46616.15 Fold015
## 16 205094.06 0.284738986 159478.98 Fold016
## 17 61822.83 0.703318526 41097.63 Fold017
## 18 98434.28 0.665433215 75094.06 Fold018
## 19 55140.45 0.498461038 34142.00 Fold019
## 20 100642.36 0.356964240 66682.26 Fold020
## 21 101753.72 0.999902172 77671.08 Fold021
## 22 98852.29 0.109443735 74915.34 Fold022
## 23 19400.34 0.980397786 19090.28 Fold023
## 24 100012.95 0.892138759 91723.82 Fold024
## 25 89692.01 0.974790826 63597.82 Fold025
## 26 64765.68 0.988495524 50776.06 Fold026
## 27 48615.56 0.644452128 38565.37 Fold027
## 28 37518.26 0.605148622 32775.49 Fold028
## 29 73561.57 0.968397426 53339.38 Fold029
## 30 91838.30 0.748856374 74768.08 Fold030
## 31 41349.85 0.969699377 32033.48 Fold031
## 32 104880.99 0.700745397 92866.77 Fold032
## 33 21207.95 0.964024820 16251.18 Fold033
## 34 56773.13 0.779004672 41976.51 Fold034
## 35 25238.45 0.866156306 21425.14 Fold035
## 36 168718.81 0.442833638 88722.54 Fold036
## 37 108513.25 0.928675506 81245.64 Fold037
## 38 33169.10 0.888432743 30910.43 Fold038
## 39 109312.36 0.574421411 77052.24 Fold039
## 40 89539.39 0.881172661 62088.45 Fold040
## 41 77211.91 0.890906580 52182.29 Fold041
## 42 32344.26 0.855199695 23991.48 Fold042
## 43 71919.38 0.951358655 66939.96 Fold043
## 44 54247.56 0.777301933 49663.94 Fold044
## 45 409923.32 0.934051062 227944.56 Fold045
## 46 35642.81 0.821403618 27543.54 Fold046
## 47 62871.52 0.323974709 57835.20 Fold047
## 48 35350.09 0.751281273 31077.08 Fold048
## 49 28940.82 0.840497656 25574.44 Fold049
## 50 125084.48 0.228620456 87509.80 Fold050
## 51 284118.37 0.225522061 216848.75 Fold051
## 52 175238.79 0.878704233 115726.52 Fold052
## 53 69645.45 0.034468479 58395.07 Fold053
## 54 51020.04 0.001612873 41583.90 Fold054
## 55 64249.77 0.703966598 55939.78 Fold055
## 56 41659.34 0.695942401 32757.02 Fold056
## 57 72368.45 0.503139869 57824.42 Fold057
## 58 32692.77 0.993588310 32272.98 Fold058
## 59 98379.15 0.982637063 71480.38 Fold059
## 60 28822.79 0.919063474 25337.20 Fold060
## 61 53744.32 0.864519196 38546.61 Fold061
## 62 64499.41 0.541037003 53211.92 Fold062
## 63 122451.85 0.002424432 94964.95 Fold063
## 64 21090.28 0.951791624 17969.37 Fold064
## 65 122510.85 0.098925991 72841.74 Fold065
## 66 35206.03 0.977037044 31282.31 Fold066
## 67 47405.77 0.900236607 39307.44 Fold067
## 68 76795.02 0.872472777 39933.90 Fold068
## 69 79177.79 0.985076003 49628.46 Fold069
## 70 71261.05 0.895193741 66422.25 Fold070
## 71 61490.26 0.878665107 58085.18 Fold071
## 72 34906.25 0.737038351 34676.91 Fold072
## 73 52900.43 0.903767098 43132.32 Fold073
## 74 97420.64 0.918958611 77129.94 Fold074
## 75 60123.46 0.013702482 52789.87 Fold075
## 76 38148.42 0.750728276 27896.18 Fold076
## 77 90195.57 0.018334994 80312.14 Fold077
## 78 43584.72 0.708801573 34791.87 Fold078
## 79 39910.75 0.623035790 31507.49 Fold079
## 80 28493.35 0.993220935 25206.21 Fold080
## 81 56170.69 0.561999105 37615.75 Fold081
## 82 385555.23 0.857323796 224856.52 Fold082
## 83 15025.70 0.996447845 14626.80 Fold083
## 84 45458.26 0.724718146 36277.23 Fold084
## 85 169200.94 0.759597897 140398.95 Fold085
## 86 47522.61 0.961810708 44525.66 Fold086
## 87 25009.11 0.836666296 19528.03 Fold087
## 88 98930.61 0.939890298 82404.34 Fold088
## 89 58191.38 0.708175848 40328.77 Fold089
## 90 29562.65 0.938892545 24319.10 Fold090
## 91 105673.10 0.300797239 61846.65 Fold091
## 92 26862.76 0.891887358 24848.83 Fold092
## 93 84145.24 0.553005571 56822.63 Fold093
## 94 78459.06 0.793118424 71164.53 Fold094
## 95 39206.20 0.795314712 30732.83 Fold095
## 96 39308.78 0.853618091 31129.53 Fold096
## 97 182175.24 0.951499697 98849.42 Fold097
## 98 105677.49 0.716735193 69069.90 Fold098
## 99 72009.30 0.880181377 51172.13 Fold099
## 100 39381.13 0.850328852 35868.41 Fold100
mean(reg.cv$resample[,3])
## [1] 57038.27
sd(reg.cv$resample[,3])
## [1] 39941.01
The distribution of MAE is not aggregated enough, which there are still many scattered distributions. We consider that the reasons for this distribution may be: 1. The presence of some extreme values 2. The presence of spatial correlation.
hist(reg.cv$resample[,3],
breaks = 50,
col = "#d5d5e6",
border = "#6A6ACD",
xlab = "Mean Absolute Error",
ylab = "Count",
main = "Distribution of MAE\nK fold cross validation;k=100"
)
ggplot(M.test.sf, aes(x = Price.AbsError)) +
geom_histogram(binwidth = 5000, color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
scale_x_continuous(name = "Sale Price Absolute Error",
breaks = seq(0, 800000, 100000),
limits=c(0, 800000)) +
scale_y_continuous(name = "Count") +
ggtitle("Distribution of Price Prediciton Error") +
theme_light()
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot() +
geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
geom_sf(data = M.test.sf, aes(colour = q5(price)),
show.legend = "Price.Error", size = 1, alpha = .5) +
scale_colour_manual(values = palette5,
labels=qBr(M.test.sf,"price"),
name="Predicted Sale Price\nin US Dollars\n(Quintile Breaks)") +
labs(title="Test Set Predicted Sale Prices") +
theme(plot.title = element_text(hjust = 0.5,size = 20)) +
mapTheme()
The MAE of sale price illustrates, prices clustering at different spatial scales.
ggplot() +
geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
geom_sf(data = M.test.sf, aes(colour = q5(Price.Error)),
show.legend = "Price.Error", size = 1, alpha = .5) +
scale_colour_manual(values = palette5,
labels=qBr(M.test.sf,"Price.Error"),
name="The MAE of Sale Price\n(Quintile Breaks)") +
labs(title="Test Set Absolute Sale Price Errors") +
theme(plot.title = element_text(hjust = 0.5,size = 15))+
mapTheme()
Even if a regression perfectly accounts for internal characteristics,
public services and amenities, it may still have errors that cluster in
space. spatial autocorrelation is another essential thing we need to
consider. Thus, we calculate the neighborhood and create its nearest
neighbor by knn2nb.
modelling_sub_200k.sf <- modelling %>%
filter(price <= 2000000)
coords <- st_coordinates(modelling_sub_200k.sf)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
modelling_sub_200k.sf$lagPrice<-
lag.listw(spatialWeights,modelling_sub_200k.sf$price)
The correlation for the spatial lag relationship is 0.83 and is highly statistically significant, indicating clustering of house prices.
As prices rise, the prices of nearby houses also rise.
ggscatter(modelling_sub_200k.sf, x = "lagPrice", y = "price",
add = "reg.line",
conf.int = TRUE,
add.params = list(fill = "lightgray", color = "#dea243"),
size = .4, alpha = .1, color = "#2d2d42") +
stat_cor(method = "pearson",
label.x = 3, label.y = 30) +
labs(title = "Price as a function of the spatial lag") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
coords.test <- st_coordinates(M.test.sf)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
M.test.sf$lagPriceError<-
lag.listw(spatialWeights.test,M.test.sf$Price.Error)
The relationship describes a significant correlation of Error as a function of the spatial lag of price. The interpretation is that as home price errors increase, so does nearby home price errors. That model errors are spatially autocorrelated suggests that critical spatial information has been omitted from the model.
st_drop_geometry(M.test.sf)%>%
ggplot(aes(lagPriceError,Price.Error)) +
geom_point(size = 1,alpha = .4, color = "#2d2d42") +
geom_smooth(method = "lm", se=F, color = "#dea243") +
labs(title = "Error as a function of the spatial lag of error") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Reg_Dataframe <- cbind(reg2$residuals,reg2$fitted.values)
Reg_Dataframe <- as.data.frame(Reg_Dataframe)
colnames(Reg_Dataframe) <- c("residuals", "predictedValues")
ggplot(reg2, aes(Reg_Dataframe$residuals)) +
geom_histogram(bins=100, color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
labs(x="Residuals",
y="Count")+
theme_light()
ggplot(data = Reg_Dataframe, aes(x = residuals , y = predictedValues)) +
geom_point(size = .3,alpha = .4, color = "#2d2d42") +
xlab("Residuals") +
ylab("Predicted Values") +
ggtitle("Residual Values vs. Predicted Values") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
ggplot() +
geom_sf(data = bd, fill = "grey90", color = "white", size = 0.4) +
geom_sf(data = M.test.sf, aes(colour = q5(Price.Error)),
show.legend = "Price.Error", size = .9, alpha = .6) +
scale_colour_manual(values = palette5,
labels=qBr(M.test.sf,"Price.Error"),
name="The Residuals Error\n(Quintile Breaks)") +
labs(title="Residuals Distribution in Test Set") +
theme(plot.title = element_text(hjust = 0.5,size = 15))+
mapTheme()
ggplot() +
geom_sf(data = bd.sf, fill = "grey90", color = "white", size = 0.4) +
geom_sf(data = M.test.sf, aes(colour = q5(lagPriceError)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(M.test.sf,"lagPriceError"),
name="Quintile\nBreaks") +
labs(title="a map of spatial lag in errors") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
The Moran’s I test results confirm our interpretation of the map. The Clustered point process yields a middling I of 0.31(Moran’s I value). But a p-value of 0.001 suggests that the observed point process is more clustered than all 999 random permutations (1 / 999 = 0.001) and is statistically significant.
moranTest <- moran.mc(M.test.sf$Price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01,fill = "#d5d5e6") +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#dea243",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in Yellow",
x="Moran's I",
y="Count") +
theme_minimal()
## Warning: Removed 2 rows containing missing values (geom_bar).
In the previous analysis, it was concluded that the residuals have a strong spatial correlation. So we added a neighborhood indicator,hoping that the model would fit better.
But after our attempts to use the neiborhood boundary data failed, we switched to using the Name data from census data as an alternative.
left_join(
st_drop_geometry(M.test.sf) %>%
group_by(NAME) %>%
summarize(meanPrice = mean(price, na.rm = T)),
mutate(M.test, predict.fe =
predict(lm(price ~ NAME, data = M.test.sf),
M.test)) %>%
st_drop_geometry %>%
group_by(NAME) %>%
summarize(meanPrediction = mean(predict.fe))) %>%
kable() %>% kable_minimal()
## Joining, by = "NAME"
| NAME | meanPrice | meanPrediction |
|---|---|---|
| Census Tract 14, Mecklenburg County, North Carolina | 273000.0 | 273000.0 |
| Census Tract 15.04, Mecklenburg County, North Carolina | 357000.0 | 357000.0 |
| Census Tract 15.05, Mecklenburg County, North Carolina | 271000.0 | 271000.0 |
| Census Tract 15.08, Mecklenburg County, North Carolina | 205000.0 | 205000.0 |
| Census Tract 16.08, Mecklenburg County, North Carolina | 305000.0 | 305000.0 |
| Census Tract 19.10, Mecklenburg County, North Carolina | 220000.0 | 220000.0 |
| Census Tract 19.22, Mecklenburg County, North Carolina | 239437.5 | 239437.5 |
| Census Tract 19.23, Mecklenburg County, North Carolina | 231333.3 | 231333.3 |
| Census Tract 30.06, Mecklenburg County, North Carolina | 615375.0 | 615375.0 |
| Census Tract 30.08, Mecklenburg County, North Carolina | 390500.0 | 390500.0 |
| Census Tract 30.12, Mecklenburg County, North Carolina | 1025000.0 | 1025000.0 |
| Census Tract 30.19, Mecklenburg County, North Carolina | 545900.0 | 545900.0 |
| Census Tract 31.05, Mecklenburg County, North Carolina | 505000.0 | 505000.0 |
| Census Tract 31.10, Mecklenburg County, North Carolina | 378000.0 | 378000.0 |
| Census Tract 33.02, Mecklenburg County, North Carolina | 963500.0 | 963500.0 |
| Census Tract 36, Mecklenburg County, North Carolina | 437000.0 | 437000.0 |
| Census Tract 38.02, Mecklenburg County, North Carolina | 290000.0 | 290000.0 |
| Census Tract 40, Mecklenburg County, North Carolina | 200500.0 | 200500.0 |
| Census Tract 41.01, Mecklenburg County, North Carolina | 585000.0 | 585000.0 |
| Census Tract 43.02, Mecklenburg County, North Carolina | 283500.0 | 283500.0 |
| Census Tract 43.03, Mecklenburg County, North Carolina | 195000.0 | 195000.0 |
| Census Tract 43.07, Mecklenburg County, North Carolina | 375000.0 | 375000.0 |
| Census Tract 48, Mecklenburg County, North Carolina | 350000.0 | 350000.0 |
| Census Tract 53.05, Mecklenburg County, North Carolina | 103000.0 | 103000.0 |
| Census Tract 53.06, Mecklenburg County, North Carolina | 247000.0 | 247000.0 |
| Census Tract 53.07, Mecklenburg County, North Carolina | 207500.0 | 207500.0 |
| Census Tract 54.05, Mecklenburg County, North Carolina | 226392.9 | 226392.9 |
| Census Tract 54.06, Mecklenburg County, North Carolina | 245750.0 | 245750.0 |
| Census Tract 55.19, Mecklenburg County, North Carolina | 385000.0 | 385000.0 |
| Census Tract 55.20, Mecklenburg County, North Carolina | 286000.0 | 286000.0 |
| Census Tract 55.29, Mecklenburg County, North Carolina | 338000.0 | 338000.0 |
| Census Tract 55.30, Mecklenburg County, North Carolina | 373333.3 | 373333.3 |
| Census Tract 55.31, Mecklenburg County, North Carolina | 385625.0 | 385625.0 |
| Census Tract 55.35, Mecklenburg County, North Carolina | 260000.0 | 260000.0 |
| Census Tract 56.15, Mecklenburg County, North Carolina | 361375.0 | 361375.0 |
| Census Tract 56.17, Mecklenburg County, North Carolina | 269088.2 | 269088.2 |
| Census Tract 56.18, Mecklenburg County, North Carolina | 436625.0 | 436625.0 |
| Census Tract 56.19, Mecklenburg County, North Carolina | 329583.3 | 329583.3 |
| Census Tract 56.21, Mecklenburg County, North Carolina | 369974.4 | 369974.4 |
| Census Tract 56.26, Mecklenburg County, North Carolina | 324121.6 | 324121.6 |
| Census Tract 56.27, Mecklenburg County, North Carolina | 255944.4 | 255944.4 |
| Census Tract 57.10, Mecklenburg County, North Carolina | 204000.0 | 204000.0 |
| Census Tract 57.14, Mecklenburg County, North Carolina | 504416.7 | 504416.7 |
| Census Tract 57.16, Mecklenburg County, North Carolina | 298083.3 | 298083.3 |
| Census Tract 57.19, Mecklenburg County, North Carolina | 349000.0 | 349000.0 |
| Census Tract 57.21, Mecklenburg County, North Carolina | 384400.0 | 384400.0 |
| Census Tract 57.22, Mecklenburg County, North Carolina | 358000.0 | 358000.0 |
| Census Tract 58.26, Mecklenburg County, North Carolina | 236500.0 | 236500.0 |
| Census Tract 58.30, Mecklenburg County, North Carolina | 308625.0 | 308625.0 |
| Census Tract 58.50, Mecklenburg County, North Carolina | 383000.0 | 383000.0 |
| Census Tract 58.53, Mecklenburg County, North Carolina | 357500.0 | 357500.0 |
| Census Tract 58.55, Mecklenburg County, North Carolina | 529000.0 | 529000.0 |
| Census Tract 59.13, Mecklenburg County, North Carolina | 402000.0 | 402000.0 |
| Census Tract 59.18, Mecklenburg County, North Carolina | 315277.8 | 315277.8 |
| Census Tract 59.19, Mecklenburg County, North Carolina | 393555.6 | 393555.6 |
| Census Tract 59.20, Mecklenburg County, North Carolina | 810400.0 | 810400.0 |
| Census Tract 59.21, Mecklenburg County, North Carolina | 441000.0 | 441000.0 |
| Census Tract 59.22, Mecklenburg County, North Carolina | 415000.0 | 415000.0 |
| Census Tract 59.24, Mecklenburg County, North Carolina | 323000.0 | 323000.0 |
| Census Tract 59.25, Mecklenburg County, North Carolina | 309500.0 | 309500.0 |
| Census Tract 59.26, Mecklenburg County, North Carolina | 462666.7 | 462666.7 |
| Census Tract 60.05, Mecklenburg County, North Carolina | 352357.1 | 352357.1 |
| Census Tract 60.08, Mecklenburg County, North Carolina | 280333.3 | 280333.3 |
| Census Tract 60.09, Mecklenburg County, North Carolina | 302750.0 | 302750.0 |
| Census Tract 60.11, Mecklenburg County, North Carolina | 237857.1 | 237857.1 |
| Census Tract 60.12, Mecklenburg County, North Carolina | 264500.0 | 264500.0 |
| Census Tract 60.13, Mecklenburg County, North Carolina | 250500.0 | 250500.0 |
| Census Tract 60.14, Mecklenburg County, North Carolina | 377000.0 | 377000.0 |
| Census Tract 60.15, Mecklenburg County, North Carolina | 244333.3 | 244333.3 |
| Census Tract 60.16, Mecklenburg County, North Carolina | 252250.0 | 252250.0 |
| Census Tract 61.03, Mecklenburg County, North Carolina | 708500.0 | 708500.0 |
| Census Tract 61.08, Mecklenburg County, North Carolina | 212500.0 | 212500.0 |
| Census Tract 61.10, Mecklenburg County, North Carolina | 283666.7 | 283666.7 |
| Census Tract 61.11, Mecklenburg County, North Carolina | 271500.0 | 271500.0 |
| Census Tract 61.14, Mecklenburg County, North Carolina | 277666.7 | 277666.7 |
| Census Tract 62.23, Mecklenburg County, North Carolina | 340000.0 | 340000.0 |
| Census Tract 63.06, Mecklenburg County, North Carolina | 346100.0 | 346100.0 |
| Census Tract 63.09, Mecklenburg County, North Carolina | 445000.0 | 445000.0 |
| Census Tract 63.10, Mecklenburg County, North Carolina | 738666.7 | 738666.7 |
| Census Tract 63.11, Mecklenburg County, North Carolina | 515000.0 | 515000.0 |
| Census Tract 64.04, Mecklenburg County, North Carolina | 694000.0 | 694000.0 |
| Census Tract 9, Mecklenburg County, North Carolina | 980000.0 | 980000.0 |
Re-estimating the regression with the neighborhood Name feature.
reg.nhood <- lm(price ~ ., data = as.data.frame(M.train.sf) %>%
dplyr::select(NAME,price,sale_year,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))
M.test.nhood <-
M.test.sf %>%
mutate(Regression = "Neighborhood Effects",
Price.Predict = predict(reg.nhood, M.test.sf),
Price.Error = Price.Predict- price,
Price.AbsError = abs(Price.Predict- price),
Price.APE = (abs(Price.Predict- price)) / price) %>%
filter(price < 2000000)
## Warning in predict.lm(reg.nhood, M.test.sf): prediction from a rank-deficient
## fit may be misleading
challenge <-
challenge %>%
mutate(Price.Predict = predict(reg.nhood, challenge))
## Warning in predict.lm(reg.nhood, challenge): prediction from a rank-deficient
## fit may be misleading
The R square of our new neighborhood model nearly in the range of 0.82-0.84. We are satisfied with the result.
stargazer(reg.nhood,
type = "text",
title="Test Set Regression Results",
keep=c("Constant "))
##
## Test Set Regression Results
## ================================================
## Dependent variable:
## ----------------------------
## price
## ================================================
## Observations 45,138
## R2 0.835
## Adjusted R2 0.834
## Residual Std. Error 107,803.500 (df = 44815)
## F Statistic 703.485*** (df = 322; 44815)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
#write.csv(challenge,"/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/predictions.csv", row.names = FALSE)
The AbsError and APE all decrease, which means the Neighborhood Effects model is more accurate on both a dollars and percentage basis.
M.test.sf$Regression <- "Baseline Regression"
bothRegressions <-
rbind(
dplyr::select(M.test.sf, starts_with("price"), Regression, NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error)),
dplyr::select(M.test.nhood, starts_with("price"), Regression, NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error)))
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -NAME) %>%
filter(Variable == "Price.AbsError" | Variable == "Price.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable() %>% kable_minimal()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
| Regression | Price.AbsError | Price.APE |
|---|---|---|
| Baseline Regression | 58297.31 | 0.1963437 |
| Neighborhood Effects | 44578.81 | 0.1354487 |
Predicted prices are plotted as a function of observed prices. Recall the purple line represents a would-be perfect fit, while the yellow line represents the predicted fit.
bothRegressions %>%
dplyr::select(Price.Predict, price, Regression) %>%
ggplot(aes(price, Price.Predict)) +
geom_point(size = .5,alpha = .3, colour = "grey15") +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#5252B8") +
stat_smooth(aes(Price.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#D0A561") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price", size = 15,
subtitle="Purple line represents a perfect prediction \nYellow line represents prediction") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
mm_table <- left_join(
st_drop_geometry(M.test.sf) %>%
group_by(NAME) %>%
summarize(meanPrice = mean(price, na.rm = T)),
st_drop_geometry(M.test.sf) %>%
group_by(NAME) %>%
summarize(MAPE = mean(Price.APE, na.rm = T)))
## Joining, by = "NAME"
There is no strong relationship between MAPE and average housing price, when MAPE is measured in blocks.
mm_table <- merge(acsTracts20,
mm_table,by.x = 'NAME',
by.y = 'NAME')
mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]
ggplot(data=st_drop_geometry(mm_table),
aes(meanPrice, MAPE)) +
geom_point(size = .8,alpha = .7, colour = "grey15") +
labs(title = 'MAPE as a function of mean price in each census tract') +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
ggplot() +
geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
geom_sf(data = mm_table, color = "white",size = 0.1, aes(fill = MAPE)) +
scale_fill_gradient(low = "#5252B8",
high = "#D0A561",
name = "Quintile\nBreaks") +
theme(plot.title = element_text(hjust = 0.5))+
labs(title = "MAPE of Each Census Tract") +
mapTheme()
The fit in the two partitions is basically the same, and the ideal reason is because the fit of our model effect is better.
We are not sure whether it is an ideal model, because we have calculated the Census data factors in regression model. Maybe it is due to these factors in census data has the same spatial lag. We’re still debating whether using census data for spatial generalizability test is a good choose.
The fit is relatively better in high-income areas, and perhaps not friendly to low-income groups if such a model is used for taxation and housing stock prices.
ggplot() +
geom_sf(data = na.omit(acsTracts20),
aes(fill = raceContext),
color = "white",
size = 0.4
) +
scale_fill_manual(values = c("#F3CD88", "#6A6ACD"),
name="Race Context") +
labs(title = "Percent of the White People") +
mapTheme()
bothRegressions <-
bothRegressions %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102179')
acsTracts20 <-
acsTracts20 %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102179')
st_join(bothRegressions, acsTracts20) %>%
filter(!is.na(raceContext)) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(Price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood income context") %>% kable_minimal()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
| Regression | Majority Non-White | Majority White |
|---|---|---|
| Baseline Regression | 21% | 16% |
| Neighborhood Effects | 14% | 12% |
We were satisfied with our operation of converting factor types (converting a lot of numerical data to categlorical data) and the final model showed that they served as a good fit. But there are still many interesting variables that we have not yet had time to consider. For example, we did not find a suitable data source for the crime rate. Some of the data are only available as geometry data divided by administrative areas. We did not find a good way to import them into the model.
Unfortunately, we did not have more time to steptest our model, so the prediction model still contains 23 independent variables.
According to the MAPE and mean house price plots, there is no strong association between our MAPE and mean house price, indicating that our neighborhood factor has an effect.
We would recommend this model. first we calculated our APE in the test set to be around 0.16, which is an acceptable margin of error.
When we compare the baseline model with the neighborhood model, we find that the regression curve becomes very well fitted after adding the spatial factor. the value of r-squared also remains stable at around 0.83.
And in terms of generalizability, our model has relatively strong generalizability across different kinds of partitions. However, it is worth mentioning that our model seems to fit better for high-income communities. The reason may be due to the larger amount of data in high-income communities or the richer facilities in high-income communities.
There are still some limitations in our model. If we need to create a super model that is broadly applicable to many types of communities, firstly, we are going to find the best predictive features or variables. Secondly, we have to try to inject enough predictive power into the model to make good predictions without over-fitting.